Peculiar Genes Selection: A new features selection method to improve classification performances in imbalanced data sets
نویسندگان
چکیده
High-Throughput technologies provide genomic and trascriptomic data that are suitable for biomarker detection for classification purposes. However, the high dimension of the output of such technologies and the characteristics of the data sets analysed represent an issue for the classification task. Here we present a new feature selection method based on three steps to detect class-specific biomarkers in case of high-dimensional data sets. The first step detects the differentially expressed genes according to the experimental conditions tested in the experimental design, the second step filters out the features with low discriminative power and the third step detects the class-specific features and defines the final biomarker as the union of the class-specific features. The proposed procedure is tested on two microarray datasets, one characterized by a strong imbalance between the size of classes and the other one where the size of classes is perfectly balanced. We show that, using the proposed feature selection procedure, the classification performances of a Support Vector Machine on the imbalanced data set reach a 82% whereas other methods do not exceed 73%. Furthermore, in case of perfectly balanced dataset, the classification performances are comparable with other methods. Finally, the Gene Ontology enrichments performed on the signatures selected with the proposed pipeline, confirm the biological relevance of our methodology. The download of the package with the implementation of Peculiar Genes Selection, 'PGS', is available for R users at: http://github.com/mbeccuti/PGS.
منابع مشابه
A hybrid filter-based feature selection method via hesitant fuzzy and rough sets concepts
High dimensional microarray datasets are difficult to classify since they have many features with small number ofinstances and imbalanced distribution of classes. This paper proposes a filter-based feature selection method to improvethe classification performance of microarray datasets by selecting the significant features. Combining the concepts ofrough sets, weighted rough set, fuzzy rough se...
متن کاملA New Framework for Distributed Multivariate Feature Selection
Feature selection is considered as an important issue in classification domain. Selecting a good feature through maximum relevance criterion to class label and minimum redundancy among features affect improving the classification accuracy. However, most current feature selection algorithms just work with the centralized methods. In this paper, we suggest a distributed version of the mRMR featu...
متن کاملA New Hybrid Method for Improving the Performance of Myocardial Infarction Prediction
Abstract Introduction: Myocardial Infarction, also known as heart attack, normally occurs due to such causes as smoking, family history, diabetes, and so on. It is recognized as one of the leading causes of death in the world. Therefore, the present study aimed to evaluate the performance of classification models in order to predict Myocardial Infarction, using a feature selection method tha...
متن کاملDiagnosis of the disease using an ant colony gene selection method based on information gain ratio using fuzzy rough sets
With the advancement of metagenome data mining science has become focused on microarrays. Microarrays are datasets with a large number of genes that are usually irrelevant to the output class; hence, the process of gene selection or feature selection is essential. So, it follows that you can remove redundant genes and increase the speed and accuracy of classification. After applying the gene se...
متن کاملA Novel One Sided Feature Selection Method for Imbalanced Text Classification
The imbalance data can be seen in various areas such as text classification, credit card fraud detection, risk management, web page classification, image classification, medical diagnosis/monitoring, and biological data analysis. The classification algorithms have more tendencies to the large class and might even deal with the minority class data as the outlier data. The text data is one of t...
متن کامل